Airbnb has become a very controversial topic over the last 10 years, especially in big urban touristic places like Paris. Indeed the number of rental per year has been multiplied by 350 between 2011 and 2015 only in Paris. On the one hand, it allowed a better distribution of tourists in the city, but on the other hand, more conventional hospitality companies have been complaining, leading to news legislation. The expansion of the Airbnb platform has undoubtedly shifted the way tourists travel, and consider cities has a tourist area. It is thus a very interesting topic of investigation as it engenders a large amount of insightful data.
In this project we will explore a dataset of Paris Airbnb data. We will focus our exploration on “Paris intramuros” (inside the “Boulevard péripherique”) Airbnb listing, for the years 2010 to 2016. In a first part, we will discover the dataset and its different variables, select the most insightful variables and give it a good cleaning. Once our dataset is nice and tidy, we will explore the different variables distribution in a second part. In a third part,using the data we got, we will try to answer some questions as :
Eventually, we will discuss the different insights we got and their limitation.
In this exploratory data analysis, we will mainly use the dplyr library to tidy and wrangle our data, and ggplot2 to plot this data. We thus need to load all the library used in this analysis
library(dplyr)
library(ggplot2)
library(ggthemes)
library(zoo)
library(lubridate)
library(viridis)
library(plotly)
library(ggmap)
library(ggpubr)
library(cowplot)
library(GGally)
library(ggrastr)
library(DescTools)
library(stringr)
library(ggridges)
load("AirBnB.RData")
The L dataset has 52725 observations and 95 variables. The function View() is used in the first place to have a general look over the set and make a first selection of interesting variables, that have less than 5% of NA values. Here, we will just display the whole dataset variables name.
colnames(L)
## [1] "id" "listing_url"
## [3] "scrape_id" "last_scraped"
## [5] "name" "summary"
## [7] "space" "description"
## [9] "experiences_offered" "neighborhood_overview"
## [11] "notes" "transit"
## [13] "access" "interaction"
## [15] "house_rules" "thumbnail_url"
## [17] "medium_url" "picture_url"
## [19] "xl_picture_url" "host_id"
## [21] "host_url" "host_name"
## [23] "host_since" "host_location"
## [25] "host_about" "host_response_time"
## [27] "host_response_rate" "host_acceptance_rate"
## [29] "host_is_superhost" "host_thumbnail_url"
## [31] "host_picture_url" "host_neighbourhood"
## [33] "host_listings_count" "host_total_listings_count"
## [35] "host_verifications" "host_has_profile_pic"
## [37] "host_identity_verified" "street"
## [39] "neighbourhood" "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed" "city"
## [43] "state" "zipcode"
## [45] "market" "smart_location"
## [47] "country_code" "country"
## [49] "latitude" "longitude"
## [51] "is_location_exact" "property_type"
## [53] "room_type" "accommodates"
## [55] "bathrooms" "bedrooms"
## [57] "beds" "bed_type"
## [59] "amenities" "square_feet"
## [61] "price" "weekly_price"
## [63] "monthly_price" "security_deposit"
## [65] "cleaning_fee" "guests_included"
## [67] "extra_people" "minimum_nights"
## [69] "maximum_nights" "calendar_updated"
## [71] "has_availability" "availability_30"
## [73] "availability_60" "availability_90"
## [75] "availability_365" "calendar_last_scraped"
## [77] "number_of_reviews" "first_review"
## [79] "last_review" "review_scores_rating"
## [81] "review_scores_accuracy" "review_scores_cleanliness"
## [83] "review_scores_checkin" "review_scores_communication"
## [85] "review_scores_location" "review_scores_value"
## [87] "requires_license" "license"
## [89] "jurisdiction_names" "instant_bookable"
## [91] "cancellation_policy" "require_guest_profile_picture"
## [93] "require_guest_phone_verification" "calculated_host_listings_count"
## [95] "reviews_per_month"
Here is the list of variable we can keep for further analysis :
L2<-L%>%
select(c('id','name','description','host_id','host_name','host_is_superhost','host_total_listings_count','neighbourhood_cleansed','zipcode','latitude','longitude','property_type','room_type','accommodates','bathrooms','bedrooms','beds','bed_type','price','number_of_reviews'))
Now, we can have a closer look at each selected variable.
summary(L2)
## id name
## Min. : 2623 Charming flat in the heart of Paris: 39
## 1st Qu.: 3470301 Appartement au coeur de Paris : 25
## Median : 6965852 Cosy flat in the heart of Paris : 23
## Mean : 7069608 Studio : 23
## 3rd Qu.:10740059 Cosy studio in the heart of Paris : 21
## Max. :13819560 Charmant studio au coeur de Paris : 20
## (Other) :52574
## description
## Mon logement est parfait pour les couples, les voyageurs en solo et les voyageurs d'affaires. : 31
## Mon logement est parfait pour les couples et les voyageurs en solo. : 15
## Hello!^^ : 14
## Mon logement est parfait pour les couples, les voyageurs en solo, les voyageurs d'affaires et les familles (avec enfants). : 12
## Ideally located between Opéra and Saint Lazare , this completely renovated â\200œHausmannianâ\200\235 apartment has been brought to its former glory and enjoys one of the prime locations in the heart of Paris.: 9
## Mon logement est parfait pour les voyageurs en solo. : 9
## (Other) :52635
## host_id host_name host_is_superhost
## Min. : 2626 Marie : 583 : 46
## 1st Qu.: 6158190 Nicolas : 436 f:50513
## Median :15885410 Pierre : 418 t: 2166
## Mean :22485601 Caroline: 388
## 3rd Qu.:34348717 Anne : 387
## Max. :81397049 Sophie : 372
## (Other) :50141
## host_total_listings_count neighbourhood_cleansed zipcode
## Min. : 0.00 Buttes-Montmartre : 6025 75018 : 5973
## 1st Qu.: 1.00 Popincourt : 4883 75011 : 4825
## Median : 1.00 Vaugirard : 3878 75015 : 3799
## Mean : 5.83 Batignolles-Monceau: 3603 75010 : 3511
## 3rd Qu.: 2.00 Entrepôt : 3466 75017 : 3465
## Max. :1024.00 Passy : 3074 75020 : 2859
## NA's :46 (Other) :27796 (Other):28293
## latitude longitude property_type
## Min. :48.81 Min. :2.221 Apartment :50663
## 1st Qu.:48.85 1st Qu.:2.323 Loft : 567
## Median :48.86 Median :2.347 House : 537
## Mean :48.86 Mean :2.344 Bed & Breakfast: 394
## 3rd Qu.:48.88 3rd Qu.:2.369 Condominium : 266
## Max. :48.91 Max. :2.475 Other : 122
## (Other) : 176
## room_type accommodates bathrooms bedrooms
## Entire home/apt:45177 Min. : 1.000 Min. :0.00 Min. : 0.000
## Private room : 7001 1st Qu.: 2.000 1st Qu.:1.00 1st Qu.: 1.000
## Shared room : 547 Median : 2.000 Median :1.00 Median : 1.000
## Mean : 3.051 Mean :1.09 Mean : 1.059
## 3rd Qu.: 4.000 3rd Qu.:1.00 3rd Qu.: 1.000
## Max. :16.000 Max. :8.00 Max. :10.000
## NA's :243 NA's :193
## beds bed_type price number_of_reviews
## Min. : 0.000 Airbed : 35 $60.00 : 3055 Min. : 0.00
## 1st Qu.: 1.000 Couch : 1182 $50.00 : 3047 1st Qu.: 0.00
## Median : 1.000 Futon : 449 $70.00 : 2787 Median : 3.00
## Mean : 1.684 Pull-out Sofa: 5066 $80.00 : 2598 Mean : 12.59
## 3rd Qu.: 2.000 Real Bed :45993 $100.00: 2073 3rd Qu.: 13.00
## Max. :16.000 $90.00 : 2031 Max. :392.00
## NA's :80 (Other):37134
Given this summary, we can see that :
# check the Spearman's rank correlation coefficient between the continuous variable to spot any relationship between 2 variables.
ggcorr(L2,label=TRUE,hjust = 0.75,method=c("pairwise","spearman"))+
ggtitle("Continuous variables correlation matrix before cleaning",subtitle = "Shows a high corrolation between the size related variable")
In this chunk, we will :
L2<-L2%>%
select(-c("host_total_listings_count","host_total_listings_count","neighbourhood_cleansed","beds"))%>%
filter(zipcode %in% c(seq(75001,75020,1),75116),!is.na(bedrooms),!is.na(bathrooms))%>%
mutate(arrondissement=as.factor(substring(zipcode,5-1,5)),pricepernight=(as.numeric(gsub("[\\$,]","",price))))
#remove the price and zipcode column that have no more utility and create a variable price/accommodates ratio
L2<-L2%>%
select(-c('price','zipcode'))%>%
mutate(pricesizeratio=pricepernight/accommodates)
We can now have our first view of the price variable distribution
bp1<-L2%>%
ggplot(aes(y=pricepernight))+geom_boxplot()+coord_flip()+theme_classic()
bp2<-L2%>%
ggplot(aes(y=pricepernight))+geom_boxplot_jitter()+coord_flip()+theme_classic()
ggarrange(bp1,bp2,ncol=1)
summary(L2$pricepernight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 55 75 97 111 6081
Prices are highly concentrated with 75% of observations having a price per night under 111 dollars. We can also observe on the second plot a very high number of outliers with sometimes extreme values (6000 dollars for a night for exemple).
The next step will be to check those outliers and extreme values to make sure there are not errors. I will use for that the View() function to check observation that have a pricepernight under 20 and over 1000 dollars to make sure they are not errors (as a price per month, or other mistakes).
View(L2%>%
filter( pricepernight > 1000))
View(L2%>%
filter(pricepernight<20))
The following observations were identify as errors. Most of them had a price per month in the price variable instead of a price per night. Some of theme had a price of 0 and were identify as mistakes. They will be removed from the dataset. As we don’t need name and description anymore, this variables will also be remove from the dataset.
L2<-L2%>%
filter(!(id %in% c(5180527,3493067,12484402,6897854,4984405,8259190,8874063,9269657,11872027,1324779,957071,9173969,2217395,11640000,7474143,7284676,5046286,6755931)))%>%
select(-c("name","description"))
Now that our dataset is clean, we can create our variable that count the number of accommodations the hosts have in Paris intramuros, and the average price per night they rent their accommodations.
# Create a new dataset grouping by host_id, with 2 new variables : number of accommodation that one host possess, average price of accommodations 1 host own.
Lhost<-L2%>%
group_by(host_id)%>%
summarise(num_of_accomodation=n(),average_rental_price=mean(pricepernight))
#Join the just created Lhost to our main dataset L2
L2<-inner_join(L2,Lhost,"host_id")
From our very clean and beautiful dataset, we will create another dataset : L4 which will allow us to analyze rental periodicity. For that we will use inner_join to join the database that contains the different rentals with their date and the listing database. We will also use the lubridate and zoo to deal with and transform date type data.
L4<-
inner_join(R,L2,by=c('listing_id'='id'))%>%
mutate(date2=as.Date(date),month=as.factor(months(date2)),year=year(date2),monthyear=as.yearmon(date))
From 52725 observations and 95 variables, we have now 51051 observations and 19 variables, ready for further analysis.
In this part, we will mainly use the ggplot2 library to represent graphically the selected variables distributions.
hisplot<-L2%>%
ggplot(aes(x=number_of_reviews))+
geom_histogram(aes(fill=..count..),bins=120)+
theme_classic()+
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
legend.position = 'none',)+
scale_fill_gradient(low="blue",high="orange")+
geom_rug(aes(x=number_of_reviews))+ggtitle("Absolute frequency number of reviews distribution")
boxp<-L2%>%
ggplot(aes(x=number_of_reviews))+geom_boxplot(fill="orange")+theme_classic()+theme(legend.position = 'none',axis.text.y=element_blank(),axis.ticks.y = element_blank())
cowplot::plot_grid(hisplot, boxp,
ncol = 1, rel_heights = c(3, 1),
align = 'v', axis = 'lr')
summary(L2$number_of_reviews)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 3.00 12.65 13.00 392.00
The number of review variable has a big range in comparison to its concentration. Indeed, it goes from 0 to 392 reviews, but 75% of listings have less than 13 reviews. Also, 13775 listings have no comments at all.
hisplot<-L2%>%
ggplot(aes(x=pricepernight))+
geom_histogram(aes(fill=..count..),bins=120)+
theme_classic()+
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
legend.position = 'none',)+
scale_fill_gradient(low="blue",high="orange")+
geom_rug(aes(x=pricepernight))+ggtitle("Absolute frequency price per night distribution")
boxp<-L2%>%
ggplot(aes(x=pricepernight))+geom_boxplot(fill="orange")+theme_classic()+theme(legend.position = 'none',axis.text.y=element_blank(),axis.ticks.y = element_blank())
cowplot::plot_grid(hisplot, boxp,
ncol = 1, rel_heights = c(3, 1),
align = 'v', axis = 'lr')
We notice straight away that the range is very large, going from 9 to 3608 dollars per night. On the other hand, prices are highly concentrated around the small values, and there is a large amount of outliers and extreme values.
Let´s zoom-in.
hisplot2<-L2%>%
ggplot(aes(x=pricepernight))+
geom_histogram(aes(fill=..count..),bins=120)+
theme_classic()+
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
legend.position = 'none',)+
scale_fill_gradient(low="blue",high="orange")+
geom_rug(aes(x=pricepernight))+ggtitle("Absolute frequency price per night distribution")+
coord_cartesian(xlim = c(0,500))
boxp2<-L2%>%
ggplot(aes(x=pricepernight))+
geom_boxplot(fill="orange")+
theme_classic()+
theme(legend.position = 'none',
axis.text.y=element_blank(),
axis.ticks.y = element_blank())+
coord_cartesian(xlim = c(0,500))
cowplot::plot_grid(hisplot2, boxp2,
ncol = 1, rel_heights = c(3, 1),
align = 'v', axis = 'lr')
summary(L2$pricepernight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 55.0 75.0 96.6 111.0 3608.0
Indeed, 50% of observations have a price per night between 55 and 111 dollars, and 95% of observations have a price per night between 9 and 226 dollars per night, 99% under 401 dollars. There is a very high number of outliers and extreme outliers, which correspond to very luxurious accommodations.
L2%>%
ggplot(aes(x=pricepernight,y=..density..))+geom_histogram(aes(fill=..count..),bins=120,show.legend = FALSE)+geom_density()+theme_classic()+
theme(legend.justification = c(1, 1),
legend.position = c(1,1),
legend.background = element_rect(fill = "white", color = "black"))+
scale_fill_gradient(low="blue",high="orange")+coord_cartesian(xlim=c(0,500))+
geom_vline(aes(xintercept = mean(pricepernight),color="mean"),linetype = "dashed", size = 0.6)+
geom_vline(aes(xintercept = median(pricepernight),color="median"),linetype = "dashed", size = 0.6,show.legend = TRUE)+
geom_vline(aes(xintercept = Mode(pricepernight),color="mode"),linetype = "dashed", size = 0.6,show.legend = TRUE)+
guides(fill=FALSE)+
scale_color_manual(name = "averages", values = c(median = "blue", mean = "red",mode="green"))+
ggtitle("probability density function")
# Quantile and skewness measure
quantile(L2$pricepernight,probs=c(0.25,0.75,0.90,0.95,0.99))
## 25% 75% 90% 95% 99%
## 55 111 169 226 401
Kurt(L2$pricepernight)
## [1] 198.0597
Skew(L2$pricepernight)
## [1] 8.870281
We have a higher mean (96.6) than the median (75) higher than the mode (60). Data is highly concentrated around those moments.The probability density function of pricepernight is highly positively skewed. High number of outliers tends to distort the mean. We then must be careful with the mean and standard deviation signification for this variable. We will thus better use median and mode as averages and interquantile ranges to describe variability in this analysis.
We also notice some little bumps in the density which corresponds to rounded values as 120, 150, 200, 250, 300, 350 etc. The hosts tend to set a rounded number as a price per night for their accommodations.
In this part, by exploring the different variables related to accommodations, we got a better idea of what compose the Airbnb offer in Paris. This will help us for this 3rd part in which we will go further into this dataset analysis.
L2%>%
group_by(arrondissement)%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=arrondissement,fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme()+
coord_flip()+
scale_fill_gradient(low="blue", high="orange")+
labs(colour="")
# Zooming in
L2%>%
group_by(arrondissement)%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=arrondissement,fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme()+
coord_flip(ylim=c(0,500))+
scale_fill_gradient(low="blue", high="orange")+
labs(colour="")
# plot of price/night PDF per arrondissement
L2%>%
ggplot(aes(x = pricepernight, y = arrondissement, fill = 0.5 - abs(0.5 - stat(ecdf)))) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
scale_fill_viridis_c(option="turbo",name = "Tail probability", direction = -1)+
coord_cartesian(xlim=c(0,400))+
ggtitle("price/night PDF per arrondissement")
We can fully understand from those plots that there exists some differences in price per night distributions from one “arrondissement” to another :
Except for the 16e, “arrondissements” located on the edges of paris (12, 13, 14, 15, 17, 18, 19, 20) along the “peripherique” have lower median price per night and a lower variability in prices (smaller range, interquartile range and sharper PDF). The 19 and 20 “arrondissements”, located on the est side are the cheapest “arrondissement” regarding their median price and have the lowest price variability (smaller interquartile range and sharper PDF). Historically in Paris, the poorest neighborhood are located on the east side (19,18,20,13,12) and the richest on the west side (14,15,16,17), and it is still true now a days. Also, most of the more modest neighborhood don’t have much tourist attraction, except Montmarte in the south of 18eme.
In the middle we have the 10 11 and 9 “arrondissement” which are more central but not fully in Paris center and have intermediate median price and price variability.
On the other hand, very central “arrondissements” (1, 2, 3, 4, 5, 6, 7, 8) and the 16e which is the fanciest neighborhood of Paris and close to the Champs-Élysées and Eiffel Tower, have the highest median price per night and higher price variability (flatter PDF and bigger interquartile range). They also have more outliers and extreme values which tends to separate the mean from the median. We can easily understand why the most expensive “arrondissement” is the 8th, as it is the most beautiful neighborhood, ideally located in the center, between the Champs-Élysées, the Eiffel Tower and the city center.
# Create a new set dividing the observations in groups depending on the price range they belong to
Linter2<-L2%>%
mutate(group=cut(pricepernight,breaks=c(0,50,150,300,700,1500,3700),labels=c("<=50","51-150","151-300","300-700","700-1500",">1500")))
#Density maps of listing according to their price range group
ggmap(paris_map5)+
stat_density2d(
aes(x = longitude, y = latitude, fill = ..level..,alpha=..level..),
bins = 15, data = Linter2,
geom = "polygon"
)+scale_fill_viridis(option="magma")+
ggtitle("Density levels of airbnb accomodations per price per night range")+
theme_classic()+
theme(axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
legend.position = "none",
plot.title = element_text(size = 11)
)+
facet_wrap(~ group)
# barplot of number of listings per arrondissements faceted by price range.
Linter2%>%
ggplot(aes(x=arrondissement))+
geom_bar(aes(fill=..count..))+
theme_classic()+
theme(legend.position = "bottom",
axis.text.x = element_text(angle= 90,vjust = 0.5,size = 8 ))+
scale_fill_gradient(low = "blue", high = "orange")+
facet_wrap(~group,scales = "free_y")
The density levels of Airbnb listing and the barplots give us a few insight :
Regarding the location, price per night seems to be correlated to several geographical factor :
bedo<-L2%>%
group_by(factor(bedrooms))%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=factor(bedrooms),fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+theme()+coord_flip(ylim=c(0,2000))+
scale_fill_gradient(low="blue", high="orange")+
labs(x="bedrooms",colour="")
aco<-L2%>%
group_by(factor(accommodates))%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=factor(accommodates),fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme(axis.title.x = element_blank())+
coord_flip(ylim=c(0,2000))+
scale_fill_gradient(low="blue", high="orange")+
labs(x = "Accomodates", colour="")
ggarrange(aco,bedo,nrow = 2,common.legend = TRUE,legend = "right")
The distribution of pricepernight according to the size of the accommodation clearly shows that the bigger, the more expensive. Indeed whether it is the number of people accommodated or the number of rooms, the bigger it gets, the higher is the median price. We also notice that bigger accommodations tend to have higher variability in prices.
L2%>%
group_by(property_type)%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=property_type,fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme()+
coord_flip(ylim=c(0,2000))+
scale_fill_gradient(low="blue", high="orange")+
labs(colour="")
L2%>%
group_by(property_type)%>%
mutate(median=median(pricesizeratio))%>%
ggplot(aes(x=property_type,fill=median))+
geom_boxplot(aes(y=pricesizeratio),alpha=0.7)+
stat_summary(aes(y=pricesizeratio,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme()+
coord_flip(ylim=c(0,300))+
scale_fill_gradient(low="blue", high="orange")+
labs(colour="")
roomo<-L2%>%
group_by(room_type)%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=room_type,fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme()+
coord_flip(ylim=c(0,2000))+
scale_fill_gradient(low="blue", high="orange")+
labs(colour="")
tbed<-L2%>%
group_by(bed_type)%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=bed_type,fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme(axis.title.x = element_blank())+
coord_flip(ylim=c(0,2000))+
scale_fill_gradient(low="blue", high="orange")+
labs(colour="")
ggarrange(tbed,roomo,nrow=2,common.legend = TRUE, legend = "right",align = "v")
L2%>%
ggplot(aes(y=pricepernight,x=number_of_reviews))+
geom_point()+
geom_smooth(formula= y ~ x,se=FALSE)+
theme_classic()
cor(L2$number_of_reviews,L2$pricepernight, method="spearman")
## [1] 0.001932152
The adjusted mean curve suggests that the number of reviews doesn’t have any influence on the price per night. However, the scatter point suggests a higher variability in prices as the number of reviews get smaller.
In this section, we will cross some categorical variables to observe how combined, they can influence the price per night. We will use the median price to compare them, as the high number of outliers and extreme values in some categories tend to distort the mean. We will also use the number of bedrooms to evaluate the size of listings for comparison.
L2%>%
group_by(bedrooms,arrondissement)%>%
summarise(medianprice=median(pricepernight))%>%
ggplot(aes(x=arrondissement,factor(bedrooms)))+
geom_point(aes(size=medianprice,color=medianprice))+
scale_color_continuous(low = "blue", high = "red", breaks = c(50,100,500,1000,2000,5000) )+
scale_size(breaks=c(50,100,500,1000,2000,5000),range = c(1, 11))+
guides(color= guide_legend(), size=guide_legend())+
labs(y="n° of bedrooms")
This plot allows us to confirm that the size indeed influence the price but, also allows us to compare the prices per “arrondissements” for accommodations of the same size. We verify that for the same sizes, listings have a higher median price in some arrondissements than others. Indeed, in the most common size from 0 to 4 bedrooms, prices are always higher in the central “arrondissements” (1,2,3,4,5,6,7,8) and the 16e. On the other hand, 19e 20e 13e and 12e are always the cheapest.
L2%>%
group_by(property_type,arrondissement)%>%
summarise(medianprice=median(pricepernight))%>%
ggplot(aes(x=arrondissement,property_type))+
geom_point(aes(size=medianprice,color=medianprice))+
scale_color_continuous(low = "blue", high = "red", breaks = c(50,100,500,1000,2000,5000) )+
scale_size(breaks=c(50,100,500,1000,2000,5000),range = c(1, 11))+
guides(color= guide_legend(), size=guide_legend())
We can observe the same phenomenon here, whatever the “arrondissement”, some property types are more expensive as Villas. Also, we observe that whatever the type of property, some arrondissements are more expensive, as 8e or 16e.
L2%>%
group_by(property_type,bedrooms)%>%
summarise(medianprice=median(pricepernight))%>%
ggplot(aes(x=factor(bedrooms),property_type))+
geom_point(aes(size=medianprice,color=medianprice))+
scale_color_continuous(low = "blue", high = "red", breaks = c(50,100,500,1000,2000,5000) )+
scale_size(breaks=c(50,100,500,1000,2000,5000),range = c(1, 11))+
guides(color= guide_legend(), size=guide_legend())+
labs(x="n° of bedrooms")
Having the same size, Villas are always more expensive in high proportion. On the other hand, it is quite surprising to see that having the same amount of bedrooms, Apartment, House and town house have quite the same prices and apartment even tend to be more expensive than Houses. Loft and bed&breakfast are more expensive than apartment with comparable size.
This bivariate analysis shows us that the size seems to be the major factor of influence on the price per night, followed by the location and then certain property type.
The correlation matrix can give a idea of the degree of relationship between 2 numerical variables.
# We use the spearman method as the distribution of price per night is not normal, has a lot of outliers and the factors variables are numeric discrete and ranked (as number of bedrooms)
ggcorr(L2,label=TRUE,hjust = 0.75,method = c("pairwise", "spearman"))+
ggtitle("Continuous variables correlation matrix")
The correlation matrix displaying the spearman correlation coefficient suggest that there exists a moderate relationship (0.5 and 0.6 spearman coefficient) between the size related variables accommodates-bedrooms and pricepernight.
To sum up this part regarding the variables that can influence the price per night of a Airbnb listing in Paris, we have observed that the size, location and types are factors that have an influence on the price per night. On the other hand the number of reviews doesn’t seem to have any influence on the price per night. After having explored the variables related to the accommodations and price, We will now focus our attention on the hosts.
chipoli<-L2%>%
filter(host_is_superhost %in% c("f","t"))%>%
dplyr::group_by(host_id)%>%
summarise(superhost=str_count(host_is_superhost,"f"))
chipoli<-chipoli%>%
dplyr::group_by(host_id)%>%
summarise(superhost=mean(superhost))
chipoli%>%
ggplot(aes(x=factor(superhost)))+
geom_bar(aes(fill=..count..),color="black")+
theme_classic()+
theme(legend.position = 'none',
axis.title.x=element_blank(),
axis.title.y=element_blank())+
scale_fill_gradient(low="blue",high="orange")+
ggtitle("Host is a superhost")+
scale_x_discrete(labels=c("True","False"))
Among the 43472 different hosts that own at least 1 Airbnb listing in Paris intramuros, only 1812 are superhosts (4%).
plot1<-L2%>%
dplyr::group_by(host_id)%>%
summarise(num_of_listing=mean(num_of_accomodation))%>%
ggplot()+
geom_histogram(aes(x=num_of_listing,fill=..count..),bins=153)+
geom_rug(aes(x=num_of_listing))+
theme_classic()+
theme(legend.position = "none")+
scale_fill_gradient(low="blue",high = "orange")+
labs(x="number of listings")
plot2<-L2%>%
dplyr::group_by(host_id)%>%
summarise(num_of_listing=mean(num_of_accomodation))%>%
ggplot()+
geom_bar(aes(x=factor(num_of_listing),fill=..count..))+
theme_classic()+theme(legend.position = "none")+
scale_fill_gradient(low="blue",high = "orange")+
coord_cartesian(xlim=c(0,10))+labs(x="number of listings")
ggarrange(plot1,plot2,nrow=2)
The range of number of listing goes from 1 to 153 but the strict majority of hosts only own 1 listing in Paris intramuros. Indeed, 40350 out of the 43472 different host have only 1 listing (93%). 2450 hosts (5.6%) have 2 listings. The 2% remaining are multilisting owners.
Who are this owner that have several accommodations ?
head(L2%>%
group_by(host_id,host_name)%>%
summarise(numacommodation=n())%>%
dplyr::arrange(desc(numacommodation)),50)
## # A tibble: 50 x 3
## # Groups: host_id [50]
## host_id host_name numacommodation
## <int> <fct> <int>
## 1 2288803 Fabien 153
## 2 2667370 Parisian Home 137
## 3 12984381 Olivier 89
## 4 3972699 Hanane 78
## 5 21630783 Pierre 64
## 6 39922748 Clara 64
## 7 3943828 Caroline 62
## 8 789620 Charlotte 56
## 9 11593703 Rudy And Benjamin 56
## 10 152242 Delphine 53
## # ... with 40 more rows
Having a look at the 50 owners that own the highest number of accommodations listed on Airbnb, we notice that most of them are private owners but there are also some companies as “Parisian home” or “My Apartment in Paris” that seems to manage some listings.
Does the fact that owners are superhost or own several listings have an influence on the price per night ?
L2%>%
filter(host_is_superhost %in% c("t","f"))%>%
group_by(factor(host_is_superhost))%>%
mutate(median=median(pricesizeratio))%>%
ggplot(aes(x=factor(host_is_superhost),fill=median))+
geom_boxplot(aes(y=pricesizeratio),alpha=0.7)+
stat_summary(aes(y=pricesizeratio,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
coord_flip()+
scale_fill_gradient(low="blue", high="orange")+
labs(x = "Host is super host",y="price size ratio", colour="")
L2%>%
dplyr::group_by(host_id)%>%
summarise(num_of_listing=n(),average_price=mean(pricepernight))%>%
ggplot()+
geom_point(aes(x=num_of_listing,y=average_price))+
theme_classic()+
theme(legend.position = "none")+
scale_fill_gradient(low="blue",high = "orange")+
geom_smooth(aes(x=num_of_listing,y=average_price))
# Create a set grouping observations by the number of listings their hosts has
Linter3<-L2%>%
mutate(group=cut(num_of_accomodation,breaks=c(0,1,2,5,15,50,154),labels=c("1","2","3-5","6-15","16-50",">50")))
Linter3%>%
group_by(group)%>%
mutate(median=median(pricepernight))%>%
ggplot(aes(x=group,fill=median))+
geom_boxplot(aes(y=pricepernight),alpha=0.7)+
stat_summary(aes(y=pricepernight,color="mean"),fun = "mean", size = 1, geom = "point",shape=23)+
theme_classic()+
theme()+
coord_flip(ylim = c(0,500))+
scale_fill_gradient(low="blue", high="orange")+
labs(x="n° of listings own by the host", colour="")
# plot of price/night PDF per number of listing owned
Linter3%>%
ggplot(aes(x = pricepernight, y = group, fill = 0.5 - abs(0.5 - stat(ecdf)))) +
stat_density_ridges(geom = "density_ridges_gradient", calc_ecdf = TRUE) +
scale_fill_gradient(low = "blue",high="orange",name="Tail probability")+
coord_cartesian(xlim=c(0,400))+
ggtitle("price/night PDF per number of listing owned")+
labs(y="n° of listings own by the host", colour="")
We could then assume that those hosts who own multiple listings are investors who would invest in the most profitable areas. We can have a look at the location of their listings.
ggmap(paris_map5)+
stat_density2d(
aes(x = longitude, y = latitude, fill = ..level..,alpha=..level..),
bins = 15, data = Linter3,
geom = "polygon"
)+scale_fill_viridis(option="magma")+
ggtitle("Density levels of airbnb accomodations per number of accomodations owned by the hosts")+
theme(axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
legend.position = "none",
plot.title = element_text(size = 11)
)+
facet_wrap(~ group)
We can indeed observe that there exists some difference of location according to the number of accommodations own by a host :
Accommodations of hosts that possess 1 or 2 listings are spread all across the city with higher density around 18eme and 11eme just as the global density distribution of the whole dataset.
Accommodations of host that possess more than 5 listings are more concentrated, especially in the extra center(1e and 2e) around le Marais (3e and 4e) and south of the “seine” in the latin neighborhood (5e and 6e “arrondissements”), a bit less around the Champs-Elysees/Tour Eiffel (8e ,16e, 17e “arrondissements”) and a bit of 18e. The density in north east, east, south east outer areas along the “peripherique” (19e, 20e, 12e, 13e, 14e, 15e) are very low.
To sum up, we have learned that the vast majority of host only have one listing and are not superhost. We can assume that this kind of host rent the place where they live. Few of the hosts have several and even a large number of listings for some of them. Some of those multilistings owner are private companies. The fact of being a super host does not have any influence on the price per night. On the other hand, host that have more than 5 listings tends to have them located in the most touristic and profitable neighborhood, and rent them at higher prices.
In this section, the L4 dataset created earlier will be used. This set is joining the list of Airbnb rentals with dates from 2009 to 2006 and the set of listing with information we have dealt with so far.
L4%>%
filter(monthyear!="juil. 2016")%>%
group_by(monthyear)%>%
summarise(numofrental=n())%>%
ggplot(aes(x=monthyear,y=numofrental))+
geom_point(aes(y=numofrental))+
geom_line()+
geom_smooth(se = FALSE)+
theme_classic()+
ggtitle("Evolution of the number of airbnb rental per month",subtitle = "in Paris intramuros from January 2010 to June 2016")
Here, we have removed the month of July 2016 as we do not have the whole month data. What can we learn from this plot ?
Let´s have a closer look at the periodicity
L4%>%
group_by(year,month)%>%
summarize(nummonth=n())%>%
ggplot(aes(x=factor(month,levels=c('janvier','février','mars','avril','mai','juin','juillet','août','septembre','octobre','novembre','décembre')),
y=nummonth))+
geom_line(aes(group=(year),linetype=factor(year),color=factor(year)))+
geom_point(aes(color=factor(year)))+
facet_wrap(~year,scale="free_y")+
theme_classic()+
xlab('Month of the year')+
ylab('n° of rental')+
theme(legend.position = "none",
axis.text.x = element_text(angle = 90,vjust=0.5,hjust=1))
There exists a clear periodicity pattern that can be verified every year :
Number of rental are at the lowest in February.
Then it rises gradually until June-July and either stabilize or go down in August. One could think that there is a mistake in the data as August is the summer holiday month in France an Europe so it should be one of the busiest in terms of rentals. The answer is that foreigner tourists continue to travel to Paris in August at a same level as July and other months, but french tourists visitation of Paris is dropping by a lot in August every year. They must be more interested in other holidays destinations for summertime.
The number of rental then rises again in September to reach its top level every year in October.
It then drops gradually until February an the cycle repeats itself every year.
Is the periodicity the same for all “arrondissements” ?
Let’s take 2014 for instance.
L4%>%
filter(year %in% c("2014"))%>%
group_by(year,month,arrondissement)%>%
summarize(nummonth=n())%>%
ggplot(aes(x=factor(month,levels=c('janvier','février','mars','avril','mai','juin','juillet','août','septembre','octobre','novembre','décembre')),y=nummonth,color=arrondissement))+
geom_line(aes(group=interaction(year,arrondissement),linetype=factor(year)))+
geom_point()+
theme_classic()+
xlab('Month of the year')+
ylab('n° of rental')+
theme(axis.text.x = element_text(angle = -45),
plot.title = element_text(color="black", size=11, face="bold"),
legend.position = "bottom",
legend.background = element_rect(size=0.5,linetype="solid",colour ="black"))+
labs(color="Arrondissement",linetype="Year",shape="Year")+
ggtitle("Num of rental per arrondissement in 2014")
L4%>%
filter(year %in% c("2014"))%>%
group_by(year,month,arrondissement)%>%
summarize(nummonth=n())%>%
ggplot(aes(x=factor(month,levels=c('janvier','février','mars','avril','mai','juin','juillet','août','septembre','octobre','novembre','décembre')),y=nummonth,color=arrondissement))+
geom_line(aes(group=interaction(year,arrondissement),linetype=factor(year)))+
geom_point()+
theme_classic()+
xlab('Month of the year')+
ylab('n° of rental')+
theme(axis.text.x = element_text(angle = 90,vjust = 0.3,hjust=1),
plot.title = element_text(color="black", size=11, face="bold"),
legend.position = "none",
legend.background = element_rect(size=0.5, linetype="solid",colour ="black"))+
labs(color="Arrondissement",linetype="Year",shape="Year")+
facet_wrap(~arrondissement,scales = "free_y")+
ggtitle("Num of rental per arrondissement in 2014")
Those 2 plots shows us that :
This can be verified for every single year, as seen below for 2015.
L4%>%
filter(year %in% c("2015"))%>%
group_by(year,month,arrondissement)%>%
summarize(nummonth=n())%>%
ggplot(aes(x=factor(month,levels=c('janvier','février','mars','avril','mai','juin','juillet','août','septembre','octobre','novembre','décembre')),y=nummonth,color=arrondissement))+
geom_line(aes(group=interaction(year,arrondissement),linetype=factor(year)))+
geom_point()+
theme_classic()+
xlab('Month of the year')+
ylab('n° of rental')+
theme(axis.text.x = element_text(angle = -45),
plot.title = element_text(color="black", size=11, face="bold"),
legend.position = "bottom",
legend.background = element_rect(size=0.5, linetype="solid",colour ="black"))+
labs(color="Arrondissement",linetype="Year",shape="Year")+
ggtitle("Num of rental per arrondissement in 2015")
L4%>%
filter(year %in% c("2015"))%>%
group_by(year,month,arrondissement)%>%
summarize(nummonth=n())%>%
ggplot(aes(x=factor(month,levels=c('janvier','février','mars','avril','mai','juin','juillet','août','septembre','octobre','novembre','décembre')),y=nummonth,color=arrondissement))+
geom_line(aes(group=interaction(year,arrondissement),linetype=factor(year)))+
geom_point()+
theme_classic()+
xlab('Month of the year')+ylab('n° of rental')+
theme(axis.text.x = element_text(angle = 90,vjust = 0.3,hjust=1),
plot.title = element_text(color="black", size=11, face="bold"),
legend.position = "none",
legend.background = element_rect(size=0.5, linetype="solid",colour ="black"),
)+labs(color="Arrondissement",linetype="Year",shape="Year")+
facet_wrap(~arrondissement,scales = "free_y")+
ggtitle("Num of rental per arrondissement in 2015")
Regarding the time analysis, a great rise in the number of rental and a periodicity has been verified over the years between 2010 and 2016. We also conclude that the number of rental varies according to the “arrondissement” and seems to be proportional to the number of listings this “arrondissement” has. Eventually, the periodicity seems to be the same for all the “arrondissements”.
After discovering this Airbnb raw dataset, a good cleaning was given using the dplyr library. Variables of interest were thoroughly selected before starting any analysis. Some graphical analysis with the ggplot library gave us numerous insights about the kind of accommodations we can find on Airbnb in Paris. Most of theme being small entire apartment of 0 to 2 bedrooms with real beds, able to accommodate 1 to 4 people, very spread across the city but with high density around certain spots as Montmartre or le Marais.
The prices analysis gave us some insightful information about the factors that influence or not the price per night of a listing. We discovered that location and size seem to be the factors that influence the most the price per night, followed by the certain types of accommodation. The host analysis allowed to determine that most of them only rent their own living apartment and are not superhost. On the other hand, there exists hosts that own up to 153 listings in Paris in the most profitable places and tend to have higher prices. Some of them are private companies which manage accommodations. Eventually, the rental time analysis shows an insane evolution in the number of rental from 2010 to 2016 going along with a high periodicity of rental in Paris intramuros.
This exploratory data analysis is not exhaustive as most of the analysis were made graphically and interpreting the moments. However it is a great basis for further analysis using inference or model implementation. It goes with the shiny app that allows you to explore this dataset by yourself. To go further, it would be interesting to explore the more resent data up to 2021 to see if there was a shift due to the new legislation on Airbnb in Paris, or have access to more data about the guests.